Loan Approval Prediction:

Objective

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Data Dictionary

Import necessary libraries

Read the dataset

The target column which is Personal Loan is in the middle of the dataframe, it is good to have the target column at the end of the dataframe.

Data Discription

Exploratory Data Analysis (EDA)

The dataset has 5000 observations (rows) and 14 attributes (columns)

DataType of each attribute

Presence of missing values in the dataset

There are no missing values in the dataset

Summary of Attributes

Observations:

Important : The minimum experience of a customer is -3 which indicates presence of erroneous data beacuse experience cannot be negative.

Check the duplicate data. And if any, we should remove it.

Data Cleaning

Summary table we could see that the minimum experience is a negative number which represents bad data and needs to be fixed either by dropping the records with such data or by imputation.

I am going to check the Experience column for negative values.

52 observations of the dataset have negative values in the Experience column.

Out of 52 there are 33 observations with "-1" as experience, 15 with "-2" and 4 with "-3".

I am going to see relationship of Experience with other quantitative attributes like Age, Income, CCAvg and Mortgage.

Observation:

Final decision for imputation, thatI will replace each neagtive Experience value the median of positive Experience associated with the particular Age and Education value.

I am going to check if there are still any records with negative experience value

The minimum value in Experience column is 0.00 which was -3.00 before error-fixing.

Data distribution of Independent Attributes

Age

The above plot shows a frequency distribution superimposed on a histogram for the Age attribute.
This distribution follows a Normal Distribution.

Experience
Income

Skewness score

The above distribution for the Income attribute is positively skewed (right-skewed : tail goes to the right) with a skewness score of 0.8413.

ZIP Code
Family

Observations:

The above distribution for the CCAvg attribute is highly positively skewed (right-skewed : tail goes to the right) with a skewness score of 1.5984.

Most of the customers monthly avg. spending on credit cards is between $0 to $2500. There are very few customers whose monthly avg. spending on credit card is more than $8000.

Education

Undergrad level customers are more than the Graduate and Advanced/Professional customers.

Mortgage

Skewness score

The above distribution for the Mortgage attribute is highly positively skewed (right-skewed : tail goes to the right) with a skewness score of 2.1040.

Most of the customers do not have any mortgage. There are more customers whose mortgage amount is between $80000 and $150000 whereas there are very few whose mortgage amount is more than $600000.

Securities Account

Most of the customers do not hold a securities account with the bank as compared to those who do.

CD Account

Number of customers who use the internet banking facilities provided by the bank is greater than those who do not.

Credit Card

Number of customers who do not use a credit card issued by the bank is almost double than those who do.

The CreditCard column data follows Bernoulli Distrubution.

Important : This is the case for all the binary categorical variables in the dataset that can take on exactly two values i.e. 0/1, Yes/No etc..

Target column (Personal Loan) distribution

So, I have 9.60% customers in current dataset who accepted the personal loan offer and rest of 90.40% who didn't accept.

Observation:
From the above pie chart I can see that the current dataset is hugely biased towards the customers not accepting the personal loan offer.
Hence I can build an opinion that our model will tend to perform better towards predicting which customers will not accept the personal loan. However, my goal is to identify the customer who can accept the personal loan based on the given features.

Identify correlation in data

Observation:

There is no clear relationship between the ZIP Code and other variables. There is a strong and linear relationship between the Age and Experience attribute. (As I have already discussed in the above as sections well) Income and CCAvg attributes are moderately correlated. Similarly, Income and Mortgage are also moderately correlated.

Observation:

Personal Loan with other variables individually with Multivariate Analysis

Observation:

Observation:

Observation:

Customers with high monthly average spending on credit cards are more likely to take a loan.

Observation:

Customers who do not have a CD Account with the bank, do not have loan as well. This seems to be the majority. Also, almost all customers who have a CD Account with the bank, have taken the loan as well.

Summary of EDA

Data Description:

Data Cleaning:

Observations from EDA:

Customer segmentation for borrowing loan based on EDA

Actions for data pre-processing:

Outliers detection

I am looking at some really highest values in income 230 USD compared to same age group and experience. Values for Credit card and Mortages looks fine.After identifying outliers, we can decide to remove/treat them or not. It depends,here I am not going to treat them as there will be outliers in real case scenario (in Income, Mortgage value, Average spending on the credit card, etc) and I would want our model to learn the underlying pattern for such customers.

Feature Engineering

Making dataframes with 'Experience' and without 'Experience'

Model Building

Seperating Target Variable from Independent Variables from Experience and Wihtout Experience dataframe

Spliting the data into training and test set in the ratio of 70:30

Logistic Regression

With Experience Category

Without Experience Category

Observation:

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will contribute to the revenue but in reality the customer would not have contribute to the revenue. - Loss of resources

  2. Predicting a customer will not contribute to revenue but in reality the customer would have contributed to revenue. - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Improvement of the Model

K-NN

Decission:

Model building Decision Tree

Build Model

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Obsevation:

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Next, I am train a decision tree using the effective alphas. I will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, I will find the accuracy on both Train and Test parts of our dataset.

**Maximum value of Recall is at 0.06 alpha, but I will choose decision tree will only have a root node and I would lose the buisness rules, I am going to take 0.02 where both recall are same.

Creating model with 0.02 ccp_alpha

Visualizing the Decision Tree

Observation:

I am getting a higher recall on test data between 0.002 to 0.005. Will choosed alpha as 0.002. The Recall on train and test indicate I have created a generalized model. with 96 % accuracy and reduced False negatives. Important features : Income, Graduate education, Family member 3 and 4, Ccavg, Advanced education, Age. This is the best model as false negative is only 6 on Testdata.

Comparing all the decision tree models

Conclusion:

Recommendations